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TAlLOR-APLi  An  Interactive  Cciqputer  Program 
for  Individual  Tailored  Testing 

Anyone  can  create  a test  simply  by  gathering  together  a set  of  questions. 

A good  test  will  have  some  relevemce  to  an  ability  or  quwtity  of  interest 
and  produce  reliable  scores  for  the  full  remge  of  examinees  for  which  it  is 
intended.  Tailored  testing  methods  are  Individualized  testing  methods  which 
make  use  of  the  fact  that  the  items  which  reliably  measure  an  attribute  of 
a single  individual  need  only  be  a small  subset  of  the  items  necessary  for 
measuring  a group. 

Rudimentary  tailoring  began  with  the  Binet  Intelligence  Test  and  exists 
in  other  individual  tests  such  as  the  Wechsler  Intelligence  Scales  (Wechsler, 
1958)  and  the  Peabody  Picture  Vocabulary  Test  (Dunn,  1965) . These  tests  begin 
at  levels  of  difficulty,  estimated  from  age  cmd  other  indicators,  where  a 
specified  number  of  items  in  a sequence  or  within  an  age  level  will  all  be 
answered  correctly  cuid  continue  to  more  difficult  items  luitil  a long  string 
of  errors  marks  the  probable  limit  of  success.  It  is  assumed  that  items  of 
lower  difficulty  than  those  administered  would  have  been  answered  correctly 
and  items  of  greater  difficulty  thcui  those  administered  would  have  been  in- 
correct. Whether  extending  the  range  of  the  Stanford-Binet  would  produce 
different  scores  was  investigated  by  Bradway  (1943)  who  found  no  significeuit 
difference  in  absolute  socres  or  score  reliability. 

The  first  systematic  attempt  at  tailoring  came  in  1946  in  a paper  by 
Cowden.  Cowden  applied  Wald's  (1947)  sequential  analysis  techniques,  originally 
used  for  industrial  product  testing,  to  the  special  problem  of  tailored 
testing  when  only  two  outcomes  are  possible  (pass-fail,  accept-re ject. 
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hlre-don't  hire>  etc.).  Items  %dilch  produce  the  greatest  differentiation 
between  groups  are  presented  in  the  same  order  for  each  examinee.  As  soon 
as  the  resp<mses  from  a particular  individual  allow  him  to  be  classed  with 
a specified  certainty  in  either  grot^  his  test  is  ended.  Sequential  analysis 
has  been  most  recently  applied  to  criterion  referenced  testing  in  computer 
assisted  Instruction  progr2uns  (Hood,  1970;  Ferguson,  1969;  Ferguson  & Hsu, 

1971) . 

Hick  in  1951  suggested  a rationale  for  tailored  testing  using  ideas 
from  his  work  in  signal  detection  euid  information  theory.  His  notion  was 
that  the  item  providing  the  maximum  amount  of  information  was  that  item  which 
w individual  has  a .50  probability  of  euiswerlng  correctly.  Consequently, 
the  initial  item  of  a tailored  test  should  be  the  item  of  roecui  difficvtlty 
in  the  individual's  population.  If  the  first  question  is  answered  correctly, 
the  second  item  should  be  answered  correctly  by  .50  of  the  people  who  amswered 
the  first  item  correctly  and  so  on. 

Perhaps  the  farthest  strategy  from  Hick's  ideal,  but  the  easiest  tailor- 
ing system  to  administer  without  a computer  is  the  "two  stage"  test  (Angoff 
fi  Huddleston,  1958;  Cleary,  Linn  & Rock,  1968a,  1968b;  Linn,  Rock  & Cleeury, 
1969;  Lord,  1971a) . The  two  stage  test  has  am  initial  test,  often  much  shorter 
than  the  second,  which  routes  all  the  examinees  according  to  their  scores  to 
a final  test  appropriate  to  their  ability  levels.  A variety  of  such  conpro- 
mise  strategies  exist  between  the  systems  which  branch  after  every  item  amd 
a conventional  test.  The  number  of  stages  and  items  avadlable  at  each  stage 
vary  as  do  the  scoring  and  branching  rules.  Lord  alone  investigated  200 
approaches  to  two  stage  testing. 

The  next  type  of  system  has  been  the  most  prolific  (for  a large  list  of 
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sources  and  an  excellent  overview  of  tailored  testing  see  Weiss  & Betz,  1973) . 

This  approach  structures  the  item  pool  into  a decision  tree  that  looks  similar 
to  Pascal's  triangle.  Instead  of  providing  two  unique  items  for  each  branching,  | 

the  branches  form  a lattice  of  reconverging  paths.  A person  whose  emswers 
eure  r-r-w-w-w  ends  up  in  the  same  place  as  a person  whose  cinswers  are  w-w-w-r-r 
or  emy  other  combination  of  as  many  right  and  wrong  answers.  Instead  of 
2"  - 1 items  which  would  be  required  for  the  structured  item  pool  of  a test 
in  which  n it^ns  are  actually  presented,  if  each  branch  were  unique,  the  re- 
converging structure  requires  n(n  -f  l}/2  items.  Both  totals  are  excessive 
for  all  but  very  short  tests.  There  are  other  problems  with  the  structured 
item  pool  such  as  fitting  the  size  of  the  breuiching  steps  up  and  down  (or  ra- 
ther from  side  to  side)  within  the  order  of  difficulty  to  conform  with  Hick's 
prescription.  To  get  the  proper  conditional  probabilities  of  a correct  res- 
ponse at  each  juncture,  a shrinking  step  size  is  necessary  which  is  incompa- 
tible with  the  equal  size  units  of  the  reconverging  triangle. 

One  researcher  (Mussio,  1972)  attempted  to  reduce  the  inordinate  item 
pool  requirements  by  truncating  the  lower  comers  of  the  trierngle,  but  the 
most  satisfactory  solution  to  the  many  problems  of  a structured  item  pool  is 
to  unstructure  it. 

An  unstructured  item  pool  is  primarily  one  dimensional,  with  the  possi- 
bility of  item  discrimination  being  used  as  a second  dimension.  Items  are 
chosen  at  each  stage  in  a tailored  test  according  to  their  actual  difficulty, 
which  may  not  be  the  ideal  .50,  but  will  be  as  close  to  it  as  possible. 

Lord  (1971b)  in^lements  an  unstructured  method  which  he  calls  the  "flexi-  \ 

i 

level"  test.  Beginning  with  the  item  of  ejqpected  .50  difficulty,  the  examinee 

i 

branches  up  or  down  one  item  for  each  correct  or  incorrect  response.  Wlien  an  | 


individual  is  forced  to  double  back  and  confronts  items  which  they  have 
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already  taken  they  are  singly  skipped  over.  The  test  is  ended  when  half  the 
items  have  been  taken.  The  result  of  this  technique  is  that  the  examinee 
teUces  the  half  of  the  item  pool  closest  to  their  ability  level.  Although  far 
from  ideal,  it  is  a very  easy  test  to  administer  euid  the  individual's  score 
is  singly  the  number  correct. 

Among  the  unstructured  approaches  are  three  methods  that  use  extensive 
calculations  after  each  response  to  determine  the  best  item  avail2U3le  for 
presentation. 

Novick  (1969)  has  suggested  a possible  Bayesian  method  of  tailored  test- 
ing. Beginning  with  a population  distribution  as  the  initial  prior,  the 
appropriate  impact  of  a correct  or  incorrect  answer  to  each  question  is  seen 
in  the  posterior  distributions  as  a narrowing  of  the  varieuice  emd  a movement 
of  the  mean  to  a higher  value  for  correct  responses  and  to  a lower  value 
for  Incorrect  responses.  Each  item  is  chosen  to  give  the  nuucimum  reduction 
of  variance.  The  posterior  distribution  frcxn  each  item  becomes  the  prior 
distribution  for  the  next.  The  process  is  continued  until  the  posterior 
variance  is  less  them  a predetermined  maximum. 

Owen  (1969,  1970)  has  produced  a Bayesiem  algorithm  for  actual  imple- 
mentation which  involves  a simplifying  emsumption  of  normal  priors.  In  addi- 
tion, Owen  has  Incorporated  a method  of  dealing  with  the  possibility  of  the 
correct  answer  in  a multiple  choice  format  being  guessed. 

Urry  (1970)  proposed  a maximum-likelihood  method  of  doing  very  much  the 
same  thing  as  the  Bayesian  procedures  do,  but  without  a prior  distribution. 
Rather  than  assuming  a flat  prior,  or  a population  prior  as  the  Bayesiw 
methods  do,  the  maximum-likelihood  methods  establish  an  initial  probability 
distribution  on  the  basis  of  one  correct  emd  one  Incorrect  response.  In 
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order  to  do  this  items  axe  presented  at  the  beginning  of  the  procedure  that 
aire  at  extreme  positions  in  the  difficulty  continuum.  Such  items  have  very 
low  information  value  and  represent  therefore  eui  inefficiency  in  mciximum- 
likelihood  methods. 

In  the  case  of  maximum-likelihood  approaches  an  operational  system  was 
provided  by  Reckase  (1974).  As  in  the  iiq>lementation  of  Bayesian  techniques^ 
sin^lifying  assumptions  were  made  to  reduce  the  complexity  of  necessary 
confutations.  The  Reckase  procedure  is  based  on  the  Rasch  model  (Rasch,  1960) 
which  treats  all  items  as  if  they  had  equal  discrimination  and  makes  no 
allowance  for  guessing. 

All  of  the  preceeding  methods  of  tailored  test  administration  begin  with 
knowledge  of  the  difficulty  levels  of  the  various  items  based  on  previous  con- 
ventional testing.  The  more  elegcuit  methods  also  require  calculation  of  item 
discrimination  cuid  guessing  probabilities.  The  accuracy  of  the  results  depends 
on  the  extensiveness  of  pretesting.  Gugel,  Schmidt,  and  Urry  (1976)  analyze  j 

the  results  obtained  from  Owen's  method  using  a range  of  from  500  to  2,000 
pretest  examinees. 

Applying  these  methods  to  well  established  tests  which  have  already  been 
extensively  pretested  would  not  be  difficult,  but  adding  items  would  be  a slow 
process.  For  the  majority  of  tests  which  axe  not  maintained  for  thousands 
of  examinations,  the  effort  of  pretesting  is  likely  to  be  prohibitive.  There 
is  also  the  possibility  that  the  sheer  volume  of  pretesting  would  encourage 
the  use  of  item  par^uneter8  estimated  from  the  responses  of  an  inappropriate 
sample. 

r 

Implied  Orders 

The  research  to  be  reported  in  this  paper  was  undertaken  to  produce  and 
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evaluate  by  live  testing  a test  tailoring  mechanism  described  by  Cliff  (1975) 
called  TAILOR  in  its  group  computer  program  form  (Cudeck,  Cliff  & Kehoe,  1977) 
and  TAILOR-APL  in  the  individual  testing  form  (McCormick  & Cliff,  1977)  which 
is  the  version  evaluated  here. 

As  outlined  by  Cliff  (1975) , the  origins  of  TAILOR  are  in  ordinal  scaling, 
and  its  approach  to  test  tailoring  emphasizes  the  order  relations  among  persons 
and  items.  In  addition  to  cin  emphasis  on  order  relations,  TAILOR  presents  a 
unique  solution  to  the  problem  of  gathering  item  information.  The  program  be- 
gins with  no  knowledge  of  item  difficulties  or  other  item  characteristics  and 
Dicdces  the  collection  of  item  information  part  of  test  administration.  The 
efficiency  of  tailoring  at  euiy  time  is  therefore  determined  by  the  thorough- 
ness of  the  information  so  far  collected.  In  this  way  significemt  tailoring 
can  be  enjoyed  by  examinees  who  would  have  been  forced  by  the  pretesting  re- 
quirements of  all  other  tailoring  programs  to  take  complete  tests. 

The  series  of  matrix  operations  which  define  TAILOR  take  place  in  the 
context  of  cui  e:q>anded  person-item  binary  score  matrix.  This  is  depicted  in 
Figure  1.  Instead  of  a conventional  persons  x items  matrix  in  which  the  non- 
zero entries  represent  successes  of  persons,  characterized  as  rows,  with  items, 
represented  by  columns,  the  persons  + items  x persons  -f  items  matrix  represents 
four  types  of  relations.  The  intersection  of  a person's  row  with  em  item's 
column  can  meeui  what  it  did  before,  but  can  also  represent  an  answer  that  was 
in^lied  to  be  correct  baised  on  the  individual's  previous  answers  rather  than 
being  an  actual  response  to  £ui  administered  item.  The  intersection  of  an  item's 
row  with  a person's  column,  if  non-zero,  represents  either  a wrong  answer  or 
answer  which  was  inplied  to  be  wrong.  Person-person  amd  item-item  intersec- 
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tlons  have  no  possibility  of  being  directly  observed  and  must  be  implied  frcxn 
person-item  observations.  Person-person  and  item-item  entries  refer  to  signi- 
i f leant  superiority  of  row  over  column  either  in  ability  or  difficulty.  It 

is  convenient  to  think  in  terms  of  a joint  ordering  of  persons  auid  items  on 
the  ability-difficulty  continuum  euid  to  refer  to  all  relations  as  dominatnees. 

I It  is  also  convenient  to  order  items  auid  persons  eurbitrarily  to  create  four 

submatrices  which  contain  the  four  different  types  of  relations,  as  shown 
in  Figure  2. 

When  all  relations  are  determined,  the  person  - item  and  item-person 

r 

f matrices  (wins  and  losses)  represent  the  same  information  and  are  matrix  trans- 

‘ pose- con^) laments  of  each  other. 

[ 
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Deriving  Item-Item  and  Person-Person  D(xninances  from  Observed  Responses 

We  begin  a tailored  test  by  observing  correct  and  incorrect  responses 

to  person-item  pairs.  This  information  ceui  be  recorded  as  ones  and  zeroes 

in  the  expemded  binary  score  matrix,  A.  Multiplying  the  persons  + items  x 

persons  + items  matrix  by  Itself  is  the  first  step  in  the  implication  process. 

2 

This  produces  a matrix,  A , with  entries  only  in  the  person-person  and  item- 
item  intersections: 


0 

0 

0 

0 

0 

0 

1 

1 

0 

1 

0 

0 

0 

0 

0 

0 

1 

1 

0 

1 

0 

1 

1 

2 

1 

3 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

0 

0 

0 

2 

1 

3 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

1 

0 

0 

0 

0 

0 

0 

0 

1 

0 

1 

0 

0 

0 

1 

0 

2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

1 

2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

1 

1 

0 

0 

0 

0 

0 

1 

1 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2 

2 

4 

0 

0 

0 

1 

0 

1 

0 

0 

0 

0 

0 

0 

0 

1 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

1 

1 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

(/)2rnH 


Figure  2 


10 


I 
( 

The  entry  in  a person-person  intersection  represents  the  number  of  items  which 
were  answered  correctly  by  the  row  person  and  answered  incorrectly  by  the  column 
person.  Items  which  both  missed  or  both  auiswered  correctly  produce  no  entries 

i. 

I and  are  of  no  use  in  establishing  the  ability  order  between  them.  In  a similar 

fashion  entries  in  the  item-item  submatrix  represent  persons  which  are  dominated 
by  the  row  item  and  dominate  the  column  item. 

i 

Testing  Corresponding  Entries  for  "SignifiCcUit”  Dominance 
! To  establish  the  binary  order  relations  once  the  integer  dominance  matrix 

i has  been  computed,  each  entry  representing  dominance  in  one  direction  is  con^ared 

to  the  entry  representing  the  reverse  dominance.  The  statistical  rules  for 
deciding  which  element  dominates  the  other  or  whether  no  dominance  can  be  estab- 
lished, are  divided  into  two  approaches.  The  first  approach  heuidles  cases 
where  more  thzu\  one  domincunce  has  been  observed  between  two  elements.  The  se- 
cond approach  is  designed  specifically  for  the  instance  where  a single  dominance 
has  been  recorded  in  one  direction  and  no  counter-dominance  in  the  opposite 

i 

direction. 

Looking  at  the  relationship  between  two  people, the  number  of  items  missed 
by  person  i emd  answered  correctly  by  person  j is  compared  to  the  number  answered 
correctly  by  person  i and  missed  by  person  j according  to  McNemar's  formula 
for  determining  the  significance  of  differences  between  correlated  proportions 
(Guilford  & Fruchter,  1973,  p.  165) . 
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McNemar's  Test 

Based  on  items  which  both  persons  have  taken. 

P 
e 

won 
r 

s 

0 

n lost 

1 


Person  j 

won  lost 


A 

B 

C 

0 

Z - 
(i>j) 


B - C 
V B + C ^ 


If  the  Z statistic  produced  by  McNemar's  test  exceeds  1.0,  "significauit"  domincuice 
is  recorded  by  entering  a one  at  the  appropriate  intersection  of  a binary  per- 
son-person submatrix  of  the  persons  -f  items  x persons  + items  matrix. 

In  Monte  Carlo  investigations  of  simulated  testing  which  used  a group 
testing  Fortran  version  of  TAILOR,  it  was  found  that  the  discrete  jump  from 
no  implications  to  the  case  where  one  dominemce  is  observed  in  a single  direc- 
tion did  not  allow  adequate  precision  in  establishing  significance.  If  the  one- 
zero  case  were  allowed,  too  many  false  implications  flooded  the  matrix.  If 
the  one-zero  case  was  not  allowed  very  little  tailoring  occurred.  An  interme- 
diate criterion,  a second  significance  test,  was  developed  on  the  basis  of  bi- 
nomial probabilities  specifically  to  handle  one-zero  cases,  nie  equation  for 
this  criterion  ist 


P 


N > number  of  items 

I > person  i's  total  items 
correct 


J « person  j's  total  items 
wrong 


The  slgnificcince  test  is  based  on  the  row  of  S (the  correct  answer  or  wins 
matrix)  which  contains  all  of  the  successful  individual's  wins  including  the 


j current  item.  Also  used  for  the  test  is  the  column  of  which  contains  the 

losses  of  the  unsuccessful  person.  In  the  instance  of  a one-zero  case,  one  of 
the  wins  in  the  dominant  person's  row  corresponds  to  a loss  in  the  dominated 
individual's  column.  In  order  to  determine  the  significance  of  this  corres- 
pondence, the  binomial  probability  is  calculated  for  the  event  that  the  wins 
in  the  row  and  the  losses  in  the  column  would  form  no  correspondence  if  ran- 
domly distributed.  If  the  probability  of  no  correspondence  is  higher  than  .5 

the  dominance  relation  is  retained  as  a one  in  the  binary  version  of  the  per- 

2 

son-person  submatrix,  A (binary) . 

2 

In  the  first  row  of  A (shown  below)  there  are  two  instances  of  total  dom- 
inances greater  than  one.  There  are  two  domincuices  of  item  one  over  item  four 
and  three  of  item  one  over  item  six.  Both  cases  are  handled  by  McNemar's  test 
cUid  since  no  counter-dominances  exist,  both  were  maintained  in  the  binary  ver- 
sion  of  A , which  is  A . (The  symbol  ^ shall  represent  the  conversion  of  in- 
teger entries  to  binary  relations.)  Also  in  the  first  row  are  three  instances 
where  a single  dominauice  exists  of  item  one  over  ^mother  item.  Again,  no  coun- 
ter-dominances exist.  Because  only  a single  dominance  is  involved,  these  cases 
are  handled  by  the  binomial  probability  procedure.  The  prc^ability  that  the 
first  two  might  have  occurred  by  chance  is  less  than  .5  so  they  are  retained 
in  A . The  dominance  of  item  one  over  item  five  has  lt>etter  than  a .5  proba- 
bility of  occurring  singly  by  a random  assortment  of  the  persons  who  missed 
item  one  and  those  who  answered  five  correctly,  so  it  is  not  retained. 
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The  effect  of  the  binomial  probability  test  of  significant  dominance  is 
to  limit  implications  in  the  one-zero  case  to  the  earlier  stages  of  testing 
when  information  is  scarce.  As  the  vectors  fill,  the  probability  of  a random 
one-zero  correspondence  increases  2Uid  stronger  evidence  is  required. 

The  procedure  for  determining  significant  order  relations  between  two 
items  proceeds  analogously.  Integer  products  of  the  first  multiplication  of 
the  persons  + items  matrix  by  itself  are  tested  for  significant  domin2mce 
whether  they  are  item-item  dominances  or  person-person  dominances.  Item-item 
entries  are  the  result  of  persons  who  were  dominated  by  the  row  item  emd  in 
turn  dominated  the  column  item. 


t 

I 


ji 


Determining  Higher  Order  Relations 

After  significance  testing  there  are  item-item  auid  person-person  relations 
recorded  as  binary  entries.  Analogous  to  the  process  of  determining  these 
relations  by  looking  at  items  which  were  coamon  to  each  pair  of  persons  and 
persons  which  vrere  ccoroon  to  each  pair  of  items,  item-item  or  person-person 


I' 


relations  can  be  established  by  looking  at  items  which  share  relationships 
with  other  items  and  persons  which  are  shared  in  relationships  with  pairs  of 
persons.  For  instance,  if  person  A dominates  persons  B,  C and  0 on  the  basis 
of  items  common  to  each  one  and  A,  and  B,  C <md  D all  dominate  person  E,  it 
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can  be  in^lied  that  person  A dominates  person  E,  assuming  there  are  no  such 
Ln^lications  in  the  opposite  direction.  The  new  implications  based  on  these 
person-person-person  or  item-item-item  chains  of  implications  are  arrived  at 
by  significance  testing  of  the  integer  products  of  the  binary  matrix  contain- 

4 

ing  person-person  and  item-item  relations  multiplied  by  itself,  A . Repeated 
powering  would  produce  entries  representing  longer  and  longer  chains  of 


implication,  but  the  empirical  observation  has  been  made  that  few  useful  impli- 
cations are  made  beyond  the  first  powering  of  the  matrix.  The  new  higher  order 
relations  are  then  combined  with  the  relations  of  A^  according  to  the  rules 
of  Boolean  addition  ( denoted  ^ ) . 
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0 0 0 0 0 0 
0 0 0 0 0 0 
0 0 0 0 0 0 
0 0 0 0 0 0 


0 111 
0 0 10 
0 0 0 1 
0 0 0 0 
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0 0 0 0 0 0 
0 0 0 0 0 0 
0 0 0 0 0 0 


0 111 
0 0 11 
0 0 0 1 
0 0 0 0 


Inplyinq  Person-Item  Relations  from  Observed  Scores 

Once  person-person  and  item-item  binary  relations  are  established,  Implied 
person-item  relations  corresponding  to  right  or  wrong  amsmrs  to  test  ques- 
tions, C2U1  be  determined  through  common  Items  or  comnon  persons.  The  chains 

of  implication  are  derived  by  multiplication  of  the  original  matrix  of  observed 

2 4 

scores.  A,  by  em  expemded  matrix,  (A  • A ) , containing  only  bin£ury  person- 
person  wd  item-item  relations.  Each  person-item  emd  Item-person  integer  entry 
is  tested  for  significant  dominance  over  its  counterpart  and  the  final  relations 
preserved  as  binary  entries. 
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(A  e A )A 


(A‘ 


4 

A )A 


2 4 

The  resulting  matrix,  (A  9 A )A,  is  the  matrix  of  implied  and  observed 
correct  euid  incorrect  responses,  which  is  then  combined  through  Boolean  addi- 
tion with  the  original  observed  person-item  zu\d  item-person  responses,  A. 
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Because  it  is  possible  to  imply  an  answer  to  a person-item  pair  which  con- 
tradicts the  already  observed  response,  a provision  of  the  program  at 
this  stage  prohibits  such  implications. 

If  we  combine  the  binary  matrix  with  person-person  and  item-item 
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relations  with  the  new  binary  matrix  of  person-item  euid  item-person 
relations  vb  get  an  expanded  persons  items  x persons  + items  binary  ^ 

Biatrix  which  represents  all  our  information.  Each  person  euid  item 
has  a row  containing  its  dominances  and  a column  with  an  entry  for 
each  person  or  item  by  which  it  is  dominated.  The  row  total  minus 
the  column  total  is  a net  dominance  score  which  can  be  used  to  pair 
persons  with  items  close  to  their  level  on  the  ability/difficulty 
continuum. 

Ignoring  guessing  effects,  an  item  at  a person's  level  of  ability 
has  a .5  probability  of  being  answered  either  correctly  or  incorrectly. 

This  is  the  maximum  degree  of  uncertainty  that  can  exist  about  the 
outcome  of  a person-item  confrontation.  Items  of  greater  or  lesser 
difficulty  and  greater  or  lesser  probcibility  of  being  answered  cor- 
rectly are  to  some  extent  predicted^le,  and,  if  we  consider  information 
to  be  the  resolution  of  uncertainty,  they  represent  lower  information 
value.  The  richest  source  of  information  about  person-item  outcomes, 
then,  is  pairs  of  persons  and  items  closely  matched  in  ability /diffi- 
culty. Looked  at  in  terms  of  the  number  of  binary  relations  resolved, 
items  which  readily  discriminate  between  adjacent  persons  in  the 
person  order  create  complete  sets  of  relations  for  those  persons.  If 
two  persons  CcUinot  be  relicdsly  ordered,  then  that  binary  relation  will 
be  missing  from  the  overall  matrix.  For  a given  individual  the  most 

I 

5 

informative  items  are  those  which  separate  him  from  his  closest  neigh-  | 

bors.  If  we  cem  differentiate  each  person  from  those  closest  to  him,  | 

in  the  process  we  will  collect  the  information  necessary  to  differentiate  | 


L 


him  from  everyone  else 
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We  have  already  seen  how  dominance  relations  between  persons  emd 
items,  items  and  items,  and  persons  and  persons  cem  be  chained  together 
to  imply  new  relations,  including  implied  correct  and  incorrect  emswers. 

The  total  number  of  binary  relations  within  an  ordered  set  of  elements 
is  n(n  - l)/2.  In  a binary  dominance  matrix  containing  n x n elements, 
each  element  x^^  has  a complimentaxry  element  x^^  which  expresses  the 
same  relation.  The  n diagonal  elements  are  luicdsle  to  express  dominance, 
so  we  are  left  with  n(n  - l)/2  elements.  Most  of  these  relations 
can  be  expressed  in  a variety  of  different  ways.  If  we  use  the  alpha- 
bet as  an  example,  the  matrix  showing  its  binary  order  relations  would 
have  an  entry  for  D follows  a;  entries  for  C follows  A and  D follows  C« 
entries  for  B follows  A and  D follows  b;  or  the  set  of  B follows  A,  C 
follows  B,  and  D follows  C.  Four  sets  of  relations,  then, tell  us  the  order 
between  A and  D.  Any  two  elements  in  tF.e  order  can  be  implied  from 
a number  of  chains  of  implications  equal  to  the  number  of  combinations 
possible  using  the  intervening  elements.  Only  the  order  of  adjacent 
letters  is  not  multiply  determined.  If  the  order  of  B and  C is  missing, 
there  is  no  way  to  determine  it  from  the  remaining  intersections.  On 
the  other  hand,  if  we  know  the  order  of  all  n - 1 adjacent  letter  pairs 
the  rest  follow  by  implication.  The  order  relations  between  adjacent 
or  nearly  adjacent  elements  cem  be  used  as  building  blocks  to  construct 
more  distant  relations,  but  the  reverse  is  not  the  case.  Distant 
relations  provide  little  information  about  the  other  elements  and 


are  easily  derived  from  multiple  sources.  For  this  reason  items 
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wd  persons  are  chosen  for  their  proximity  on  the  ability/difficulty 
continuum. 

On  one  hand  we  direct  oiir  observations  of  actual  person-item 
relations  to  areas  of  greatest  usefulness  by  matching  the  net  domi- 
n2mce  scores  of  persons  and  items.  On  the  other  hcuid  distant  rela- 
tions in  the  person-item  matrices  are  being  filled  in  by  implication. 
When  the  two  processes  converge  and  all  person-item  relations  are 
determined^  the  test  is  over. 

To  summarize  the  inplication  process  in  matrix  terms,  consider 
the  e^anded  matrix  A,  arbitrarily  divided  into  submatrices: 


A * 


Where  : I = item-item  relations 

P = person-person  relations 
S = person-item  relations 

= transpose  compliment  of  S (for  complete  data). 


Observed  correct  wd  Incorrect  answers  are  recorded  in  S and  S' 
respectively: 


I and  P are  null  matrices. 


Person-person  and  item-item  relations  are  provided  by  AA. 


By  signific^mce  testing  I euid  P are  transformed  from  integer 
products  of  S euid  S'  into  binary  matrices.  Fxirther  person-person 
cuid  item-item  relations  are  in^lied  from  the  squared  matrices  I and 
P.  These  also  become  binary  after  significance  testing. 


AAAA  - 


These  implications  are  combined  with  the  original  item-item 
and  person-person  in^lications  by  Boolean  addition. 


AAAA  O AA  = 


The  result  is  multiplied  by  the  original  entries  in  S and  S ' 
emd  significance  tested. 


A(a2  • A**) 


S'  (P 


P^) 


S(1  • l2) 
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The  resulting  binary  iiqplications  are  added,  in  Boolean  addi- 
tion, to  the  original  person-item  and  item-person  matrices  with 
the  provision  that  actual  answers  Ccuuiot  be  replaced  or  contradicted 
by  implied  relations. 


A e a(a2  e A**)  = 


S’  e S' (p  ® p2) 


s ® s(i  e i^) 


Although  these  matrix  equations  involve  steps  such  as  the 
reduction  of  integer  matrices  to  binary  matrices  by  significcuice 
testing  and  provisions  for  maintaining  the  original  correct  euid 
incorrect  responses,  they  can  still  be  manipulated  mathematically 
if  the  results  are  examined  cautiously. 

If  we  were  dealing  with  simple  matrix  equations  it  can  be 
noted  that: 

s(iei^)  ■ s(3's  ® s'ss's)  = sS's  • sS'sS's 
^'(pep^)  = s'(ss’ ® ss'sS')  = S'ss' ® s'sS'sS’ 

Compare  those  results  with  two  new  equations: 

(p  ® p^)s  = (ss*  ® sS'ss'ls  = SS'S  ® SS'SS'S 

(I  • l2)S'  = (S'S  • S'SS'S)S'  = S'SS’  • S'SS'SS' 


L 
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Therefore; 

(P  9 P^)S  = S{I  9 I^) 

(I  e 1^)3'  = S' (P  © p2) 

The  information  brought  to  the  step  of  implying  right  and 
wrong  answers  by  (I  © I^)  and  (P  © P^)  would  be  equivalent  except 

t 

f for  the  intervening  processes  just  mentioned.  There  is  a rough 

f 

[ equivalence  with  the  larger  of  the  two  matrices  producing  more  im- 

plications when  the  program  is  operated  with  both  matrix  calcu- 
lations . 

For  reasons  of  economy,  only  the  item-item  matrix  is  used  in 
the  in^lication  process  because  it  is  generally  larger  than  the 

I person-person  matrix,  it  maintains  a constemt  size  cuid  can  be  reused 

I with  different  persons.  Person-person  relations  can  still  be 

derived  after  the  person-item  inqplications  have  been  made  by  multi- 

1 plying  the  final  person-item  matrix  times  the  item-person  matrix. 

I 

I The  shortened  procedure  proceeds  as  follows; 


Elements  of  S emd  S'  are  observed; 
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Each  element  1^^^  is  tested  for  significant  dominance  of  a 
corresponding  element  A binary  entry  is  retained  in  I for 

each  dominance. 

I is  squared:  II  » I^ 


The  elements  of  I^  are  tested  for  significant  dominances  ^md 
the  binary  entries  retained  in  l2  are  combined  with  those  from  I 
by  Boolean  addition. 

The  Booleem  sum  is  then  premultiplied  by  S to  give  in^lied  per- 
son-item dominance  relations  and  postmultiplied  by  ^ ' to  give  item- 
person  dcminance  relations. 


Inplied 
I - P 
Relations 

Each  element  in  both  implied  dominance  matrices  is  tested  for 
significcuit  domincmce  of  its  counterpart  in  the  other  matrix. 

The  binary  results  are  added  in  Boolean  fashion  to  S euid  S'  with 
the  provision  that  actual  answers  are  not  contradicted  by  implied 
entries. 

Person-person  relations  are  then  derived  by  signlficeuice  testing 
the  product  of  the  implied  and  observed  right  2uid  wrong  answer 
matrices: 


Iiqplied 
P - I 
Relations 
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These  three  matrices  and  I 9 are  then  used  in  the  determination 
of  new  domineuice  scores. 


The  score  for  any  item  or  person  is  its  row  total  minus  its 
column  total. 

The  next  item  presented  to  each  individual  is  the  item  which 
has  a net  dominance  score  closest  to  his  own.  Excluded  from  the 
items  considered  are  all  those  which  have  been  previously  answered 
or  whose  answers  are  already  in^lied.  The  above  operations  were 
originally  devised  for  group  testing. 

This  method  of  test  tailoring  has  been  under  evaluation  (Technical 
Report  4)  using  simulated  group  testing  and  a Fortran  version  of  TAILOR 
(Cudeck,  R.  A.,  Cliff,  N and  Kehoe,  J,  1977). 

In  a conplex  variety  of  circumstances,  TAILOR  produced  an  average 
correlation  with  true  scores  equal  to  .96  of  the  con^lete  test 
correlation  with  true  scores  £md  used  an  average  of  56%  of  the  items. 

TAILOR-  APL 


TAIIOR-APL  is  not  Identical  in  its  operation  to  the  group  testing 
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■odel.  The  differences  consist  mainly  of  computational  shortcuts  which  are 
possible  when  information  is  being  gathered  for  a single  Individual  at  a time. 
In  a group  test,  information  is  gathered  about  item  order  quickly  so  that 
even  the  first  person  finished  will  take  a test  that  has  been  tailored  on  the 
basis  of  other  peoples'  responses.  The  first  person  to  take  an  individually 
adsd-nistered  test  must  £uiswer  all  the  items  because  there  is  no  item  infor- 
mation yet  available.  As  will  be  shown  later  it  may  take  only  a few  tests 
to  begin  extensive  tailoring. 

Individual  testing  is  more  appropriate  than  group  testing  for  the  lar- 
gest potential  application  of  tailored  testing  which  is  in  conjunction  with 
computer  assisted  instruction.  McKillvq>  and  Urry  (1976)  of  the  U.  S.  Civil 
Service  in  their  discussion  of  the  advantages  of  computer  administered  tailored 
tests  mention  the  ability  to  administer  individual  tests  on  a walk-in  basis. 

In  in^lementing  a version  for  individual  testing  the  following  economies 
seemed  reasonable: 

2 

Because  the  inqpact  of  individual  cuiswers  on  the  I • I matrix  is  likely 

to  be  small,  it  is  only  calculated  at  the  end  of  each  test  before  results  are 

output.  Also,  during  each  test,  in^lled  answers,  net  dcxninance  scores  auid 

2 

implied  person  dominances  based  on  P are  calculated  only  for  the  present 
examinee. 

Directly  observed  correct  and  incorrect  responses  are  recorded  in  S 
and  S' . 

2 

I O I from  the  examinees  already  tested  (a  null  matrix  if  this  is  the 
first  test) , is  pre-  and  postmultiplied  by  the  individual's  vectors  in  S smd 
S'  to  obtain  inplied  right  cmd  wrong  answers  after  significcuice  testing. 


26 


The  significant  binary  counterparts  of  IRA  (lB«>lied  Right  Answers)  and 
IWA  (In^lied  Wrong  Answers)  are  added  to  the  individual's  vectors  in  the  ver- 
^ sions  of  S and  S'  that  also  contain  inqplied  responses » with  the  provision  that 

actual  emstirers  are  not  contradicted. 

The  individual's  row  of  the  S • Iii{>lied  Wins  matrix  is  multiplied  with 
the  '5'  • Implied  Losses  matrix  to  give  the  individual's  vector  of  integer 


The  individual's  coliann  of  the  S'  0 In5)lied  Losses  matrix  is  multiplied 
with  the  S 0 Inplied  Wins  matrix  to  give  the  individual's  vector  of  Integers 
representing  the  number  of  times  the  person  is  dominated  by  other  people. 
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2 

After  the  row  emd  column  of  P • P are  tested  against  each  other  for 
significant  dominance,  the  binary  results  are  used  in  conjunction  with  the 
individual's  vectors  in  S O Implied  Wins  and  S'  O Iii%>lied  Losses  to  derive 
the  individual's  net  dominance  score. 


The  individual's  net  dominauice  score  is  his  row  total  minus  his  column 

total.  Item  scores  from  the  last  test  are  altered  by  the  individual's  entries 

in  S • IFA  euid  S'  • IWA.  The  only  way  the  process  differs  from  con^lete  cal- 

2 

culations  is  in  not  updating  the  entries  of  I 0 I until  the  test  is  finished. 

TAILOR  has  been  evaluated  in  the  past  with  both  Monte  Carlo  responses 
(Technical  Report  # 4)  generated  according  to  Bimbaum's  model  (Lord  & Novick, 
1968)  and  with  teat  simulation  using  response  matrices  from  previously  admin- 
istered con^ilete  tests.  By  generating  responses  from  formulae  it  was  possible 
to  select  levels  of  item  discrimination,  ability,  difficulty,  test  length  and 
other  parameters  with  a precision  and  flexibility  that  real  testing  doesntt 
allow.  Also,  it  is  a good  deal  easier  to  2u:range  for  a thousand  simulated 
examinees  than  real  ones.  The  reason  for  this  e]q>eriment,  the  collection  of 
data  from  real  people,  was  to  make  sure  the  program  which  had  been  developed 
with  artificial  data  would  work  as  well  with  the  real  thing. 
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The  current  study  was  designed  to  compare  tailored  test  reliedjility  with 
the  relledsillty  of  con^lete  tests  given  under  comparable  conditions.  The  re- 
li2d>ility  of  tailored  scores  combined  with  degree  of  test  shortening  would 
I then  demonstrate  the  measurement  efficiency  of  TAIIOR-APL. 

t 

Method 

Design 

Fifty  subjects  were  tested  in  the  spring  of  1977.  Half  took  a complete 
test  administered  by  conputer  and  half  took  a tailored  test  using  the  same 
item  pool.  Another  fifty  subjects  were  split  between  the  e]q>erimental  tailored 
test  condition  and  the  con^lete  test  condition  euid  tested  in  the  summer  of  1977. 

In  all  one  hundred  tests  were  given.  Subjects  were  reuidomly  assigned  to 
the  two  conditions.  Except  for  the  number  of  items  and  the  order  in  which  they 
are  presented,  the  tailored  condition  was  identical  to  the  con^lete  test  con- 
dition. 

Due  to  the  size  constraints  of  the  APE  system  at  DSC,  the  second  group 
of  tailored  subjects  did  not  take  advauitage  of  the  stored  information  2dx>ut 
item  domin2uice  provided  by  the  first  group.  The  second  group,  like  the  first 
began  with  no  item  information. 

In  order  to  obtain  a measure  of  the  reliability  of  tailored  and  non-tai- 
lored  tests  the  item  pool  of  50  anagreuns  was  divided  rauidomly  into  two  sets 
of  items.  These  items  were  presented  in  an  odd-even  fashion;  first  an  item 
from  set  one,  then  an  item  from  set  two.  This  picture  was  complicated 
slightly  in  the  case  of  the  tailored  test.  Because  the  length  of  the  tailored 
tests  cannot  be  predicted,  the  tailored  halves  were  administered  odd-even  un- 
til one  test  was  completed  and  then  all  remiainlng  questions  were  from  the 


29 


r 5 

^ I 


1 


j 

I 


unfinished  test. 

Because  the  summer  subjects  didn't  make  use  of  item  information  gathered 
in  the  spring  and  because  information  2dx>ut  ed)ility  or  difficulty  is  not 
shared  between  the  split  halves  of  the  tailored  test  condition,  there  is  a 
total  of  four  tailored  response  matrices.  In  the  presentation  of  results  these 
four  cases  (and  the  corresponding  cases  for  con^lete  test  data)  are  handled 
separately  or  merged,  when  appropriate,  for  the  various  analyses.  The  four 
matrices  represent  responses  of  the  first  cmd  second  25  subjects  to  the  A and 
B item  pool  halves. 

Keliability  w«is  chosen  as  the  principle  criterion  because  it  evaluates 
the  tailored  test  cuid  the  complete  test  independently,  unlike  the  criterion 
of  tailored  test  correlation  with  con^lete  test. 

Correlation  with  con^lete  test  scores  is  appropriate  only  if  we  assume 
the  items  are  independent  and  the  answer  em  examinee  gives  to  an  item  is  not 
affected  by  the  previous  items  presented.  If  that  is  true,  a tailored  test 
is  simply  a shorter  and  therefore  less  reliable  version  of  the  con%>lete  test. 
Using  reliability  allows  for  the  possibility  that  tailored  measures  may  better 
reflect  the  underlying  proficiency  being  evaluated. 

The  individual  testing  version  of  TAILOR  also  allowed  a second  type  of 
analysis  to  be  performed.  Because  tailoring  in  the  individual  testing  version 
is  accomplished  only  to  the  extent  that  item  information  has  accumulated  from 
previous  tests,  the  first  tests  include  all  the  items,  and  subsequent  tests 
show  progressively  greater  influences  from  the  tailoring  procedure.  The  data 
therefore  allows  a regression  analysis  to  be  done  using  the  order  of  adminis- 
tration as  an  independent  vari2d>le  which  ranges  from  a con^lete  test  to  the 
most  tailored.  If  a significant  trend  toward  higher  or  lower  scores,  more 
or  less  reliable  scores  or  greater  or  less  variance  occurred  this  could  be 
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i detected  by  the  regression  analysis. 

Subjects 

Subjects  were  100  students  from  the  introductory  psychology  classes  at 

I 

the  University  of  Southern  California. 

k 

t 

Items 

Fifty  unique  solution  anagrams  were  used  in  experimental  and  control 
conditions.  The  anagrams  were  taken  from  various  sources  and  had  either  four 
Of  five  letters  (see  Appendix  A) Random  arranging  of  the  item  order  as  well 
as  the  letter  order  was  done  by  a separate  APL  program  written  by  the  investi- 
gator. 

No  information  was  gathered  concerning  itefn  difficulty  or  discrimination 
before  the  experiment.  The  reason  for  using  the  amagrams  was  the  ease  of 
scoring  amswers  by  computer,  rather  than  the  existence  of  amy  statistical 
properties  which  would  facilitate  tailoring. 


Procedure 

Subjects  were  told  the  experiment  was  an  evaluation  of  tailored  testing 
and  that  they  would  be  required  to  solve  scrambled  word  problems  presented  by 
the  computer  at  a typewriter  style  terminal.  After  auiswering  any  questions 
they  had  aUsout  the  experiment  auid  watching  the  first  anagram  appear,  the  ex- 
j perimenter  left  the  room.  Each  anagram  had  a 30-8econd  time  limit.  The  time 

I limit  was  the  experimenter's  estimate  of  a reasonable  cutting  point  and  was 

i 

I not  based  on  any  prior  testing.  When  the  test  was  finished  a message  was 

I presented  by  the  program  telling  the  subject  to  notify  the  e]q>erimenter. 

i 

Scores 

Three  types  of  test  scores  are  used.  First  a conventional  number  correct 
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was  used  for  the  complete  test  condition.  Second  a net  dominance  score  was 
used  internally  for  matching  persons  emd  items  in  the  tailored  tests  wd  is 
also  used  in  most  of  the  analyses.  Net  dominwce  scores,  as  described  earlier^ 
involve  subtracting  the  total  number  of  elements  (items  and  persons)  which 
dominate  a particular  individual  (or  item)  from  the  total  of  the  elements  which 
are  dominated  by  that  individual  (or  item) . Third,  in  addition  to  net  domin- 
ance scores,  a score  similar  to  the  conventional  number  correct  was  computed 
for  tailored  test  subjects  to  con^are  the  tailored  test  score  distributions 
to  conplete  test  scores.  The  difference  between  this  second  tailored  test 
score  £md  a simple  correct  answer  score  is  that  for  a tailored  test  the  score 
includes  implied  correct  euiswers  as  well  as  .5  times  the  Instances  where  an 
item  is  neither  implied  nor  actually  presented.  Such  missing  entries  are 
rare  and  no  more  thw  one  ever  occurred  for  a given  individual. 

Results 

The  obtained  distributions  of  raw  scores  on  laoth  parallel  forms  of  the 
^magrams  test  are  displayed  in  Figures  3 and  4 for  conplete  and  tailored  condi- 
tions. In  addition  to  the  net  domin£uice  tailored  scores,  scores  calculated 
simileurly  to  conventional  scores,  from  the  persons  x items  sulsmatrix  are  also 
shown  to  allow  a visual  conparison  of  tailored  2uid  complete  test  distributions. 
The  two  sets  of  tailored  scores  are  not  just  linear  tr2uts formations,  however, 
and  the  reliability  of  the  net  dominance  scores  is  slightly  higher. 

The  means  obtained  from  raw  tailored  scores  are  0.0800  and  2.3000  for  the 
A amd  B parallel  forms.  Standard  deviations  for  these  distributions  are 
20.283  emd  23.592.  The  two  distributions  are  not  significantly  different  by 
t-test  (ax  .6150).  The  meeuis  of  the  recomputed  tailored  scores  are  12.540 
2md  13.650.  The  corresponding  means  of  the  conplete  test  forms  are 
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!s  on  Form  A and  Form  B along  with  aggregated  data  and  tailored 
onal  persons  by  items  matrix. 


TAILORED  SCORES  FORM  A 


Figure  4:  Histograms  showing  tailored  score  distributions*  using  net  dooujiance  scores 
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12.540  and  12.640.  Neither  the  difference  between  fom  A,  complete  and 
tailored,  nor  form  B,  complete  and  tailored  were  significant  (a=  1.000, 
a*  .2669).  A comparison  of  A and  B forms  combined,  giving  100  tailored 
scores  and  100  complete  test  scores  showed  means  of  13.095  and  12.590  for 
tailored  and  complete  test  scores  (a>  .3669).  Standard  deviations  for 
these  combined  scores  are  4.29  (tailored)  and  3.58  (complete) . 

Table  1 shows  the  correlations  between  individuals  scores  on  form  A 
aund  form  B of  both  tailored  and  conplete  tests.  It  also  shows  item  score 
correlations  between  first  and  second  batches  of  25  examinees  in  each  of 
the  two  conditions.  Above  the  diagonal  are  Kendall's  TauB's  and  below  the 
diagonal  are  Pearson  r's. 

For  fifty  individuals  the  parallel  forms  reliability  of  tailored  scores 
is  .83  (TauB  » .65) . For  auiother  fifty  individuals,  the  conplete  test  re- 
liability is  .78  (TauB  = .61).  A test  of  the  significance  of  the  difference 
in  Pearson  r's  according  to  Fisher's  z transformation  (Hays  1973)  gives  eui 
alpha  of  .52  which  is  clearly  not  significant.  The  95%  confidence  intervals 
for  these  correlations  are  .71  to  .90  (tailored)  euid  .64  to  .87  (complete). 

Tailored  scores  computed  from  the  persons  x items  matrix  gave  a reliability 
of  .79  . 

Item  score  correlations  were  slightly  higher  for  complete  tests.  In  the 
A item  pool,  item  scores  correlated  .90  (TauB  = .72)  and  .88  (TauB  = .64)  for 
conplete  and  tailored  scores  respectively,  in  the  B pool  .83  (.75)  and  .79  (.58). 
These  differences  are  also  non-significant. 

Figures  5 2md  6 are  graphic  comparisons  of  tailored  and  complete  test 
performance.  The  two  rows  of  the  two  by  three  figures  represent  the  parallel 
forms.  Figure  5 represents  the  first  25  subjects  in  each  condition  and 
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Figure  6:  Response  Matrices  (Second  25  Subjects) 
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Figure  6 the  second  25.  The  first  matrix  in  each  row  shows  the  right  and  wrong 
answers,  represented  by  boxes  and  stars  respectively,  to  questions  which  were 
actually  presented  to  subjects.  The  second  matrix  in  each  row  shows  the  actual 
responses  emd  the  responses  which  were  implied.  The  blank  spaces  in  column 
two  represent  implications  which  were  revoked  on  the  basis  of  information  pro- 
vided by  later  subjects,  except  for  the  final  subject  whose  own  responses  re- 

2 

voked  cin  implied  answer  when  the  I O I matrix  was  revised  at  the  end  of  his 
test.  The  final  matrix  of  each  row  is  the  right  and  wrong  cuiswers  observed 
in  each  of  the  conplete  tests. 

The  rows  and  columns  of  each  matrix  are  ordered  by  the  person  and  item 
scores  from  that  matrix  only,  so  rows  and  columns  are  not  comparable  between 
matrices. 

Figure  7 shows  the  observed  correct  and  incorrect  cuiswers  from  the  tailored 
tests  arranged  chronologically.  Items  are  arranged  according  to  difficulty. 

The  top  row  of  each  matrix  represents  subject  one's  performance.  Row  two  re- 
presents subject  two  aind  so  on.  The  top  two  matrices  are  the  A and  B items 
and  the  first  25  subjects.  The  lower  matrices  represent  the  second  25  subjects. 

From  these  illustrations  we  can  see  that  the  goal  of  clustering  observa- 
tions around  cUi  individual's  cibility  level  has  been  generally  satisfied.  It 
is  also  apparent  that  the  longer  vectors  of  observations  were  those  gathered 
early  in  the  experiment  when  the  availcUDle  information  would  not  allow  more 
extensive  tailoring. 

One  of  the  most  striking  differences  between  tailored  and  complete  tests 
is  the  greater  uniformity  of  the  second  column  of  matrices  in  Figures  5 and  6 
compared  to  the  third  column.  The  two  regions  of  right  and  wrong  answers  are 
more  cleanly  separated  for  the  tailored  tests. 

This  lower  degree  of  intermingling  is  a graphic  display  of  two  combined 
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Figure  7:  Observed  Responses  in  Each  Tailored  Test  Arranged 
Chronologically  lund  According  to  Item  Difficulty 
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effects.  To  a certain  extent  the  separation  is  artificially  induced  by  the 
implication  process  tdiich  does  not  allow  the  Implied  responses  to  show  any 
r£uidom  expression  of  low  probability  events.  The  item  which  has  only  a .05 
clumce  of  being  answered  correctly  is  unlikely  to  be  actually  presented  to  an 
individual  and  scored  as  it  would  be  in  a conplete  test.  This  is  not  the  entire 
expl2mation,  however.  If  only  the  observed  responses  in  the  matrices  of 
column  one  are  compared  to  corresponding  segments  of  the  complete  test  matrices, 
it  is  apparent  that  greater  consistency  exists  among  observed  scores  as  well. 

This  phenomenon  is  again  illustrated  in  Figure  8. 

Figure  8 is  a plot  of  the  percent  of  items  emswered  correctly  according 
to  their  distance  from  a person's  .5  level  of  ability.  Each  person's  vector 
of  correct  answers  was  arranged  so  his  ^5  level  was  aligned  with  zero  on  the 
abscissa.  The  plotted  points  at  -3  represent  the  number  of  times  the  third 
item  below  the  .5  level  of  each  individual's  ability  was  answered  correctly  di- 
vided by  the  number  of  times  such  an  item  was  asked.  The  number  of  times  cin  item 
is  asked  at  each  level  is  not  always  50.  In  a complete  test,  the  item  would 
not  be  available  to  persons  whose  ediility  level  was  so  close  to  the  bottom  of 
the  range  that  an  item  three  steps  lower  was  not  available.  In  the  tailored 
test  it  could  also  be  the  case  that  the  item  was  not  available  because  the 
outcome  had  already  been  implied.  The  curves  represent  only  observed  correct 
answers . 

The  rc-.nge  of  items  between  a subject's  .10  and  .90  probabilities  of  cuiswer- 
ing  correctly  are  quite  different  and  summarize  the  message  of  the  plots.  In 
the  tailored  tests  a range  of  seven  items  stretches  across  the  interval  from 
.10  to  .90.  In  a complete  test  the  same  interval  requires  seventeen  items. 

This  difference  is  also  expressed  by  the  consistency  indices  confuted  for  the 


41 


tailored  and  ooqplete  test,  observed  response  matrices.  The  average  value 
of  (Cliff,  1977)  for  tailored  tests  is  .42.  Aie  average  for  the  ocmaplete 
testa  is  .16. 

Figure  9 shows  the  number  of  questions  asked,  plotted  as  a function  of 
the  number  of  tests  given.  An  average  of  forty-four  percent  of  the  questions 
were  presented  in  each  tailored  test.  The  range  2md  mean  are  presented  for 
each  position  in  the  order. 

The  curve  in  Figure  9 appears  to  asyngtote  near  eight  items  after  fifteen 
tests  have  been  administered.  The  average  of  the  last  ten  tests  in  all  four 

i 

tailored  conditions  is  7.975  items  asked.  If  we  continued  to  give  the  ^magr^uns 
test  we  could  expect  to  present  an  average  of  32  percent  of  the  items. 

As  previously  mentioned,  the  progression  of  the  tailoring  process  from 
a complete  test  to  about  eight  items  allows  us  to  use  serial  position  as  an 
independent  regression  variable  representing  the  extent  of  tailoring. 

Regressions  were  done  with  z scores,  absolute  values  of  z scores  2md 
the  difference  of  z scores  on  forms  A £md  B as  the  dependent  variables.  It 
was  Intended  that  a trend  in  z scores  would  show  if  tailoring  induced  higher 
or  lower  test  scores.  The  correlation  obtained  was  a non-significant  -.067 
so  details  of  the  regression  are  not  presented.  The  regression  of  absolute 
z scores  was  Intended  to  detect  any  change  in  variance  that  might  occur  because 
of  tailoring.  This  correlation  was  0.000.  The  differences  between  z scores 
were  £uialyzed  to  show  any  tendency  for  tailoring  to  chamge  the  reliability 
of  the  test.  The  correlation  again  was  non-signific2mt  (r  - .114).  Thus 
scores  from  later  subjects  based  on  8 items  seem  to  be  as  reliable  as  scores 
from  earlier  subjects  who  were  given  much  longer  tests. 

Figure  10  shows  the  actual  branching  of  two  examinees  within  the  A ^md 


B item  pools.  The  items  were  ordered  from  easiest  to  hardest,  left  to  right. 


Figure  8:  Proportion  of  correct  responses  in  Tailored  and  Conplete  tests  for  items  of 
a particular  distauice  from  the  individual's  estimated  ability. 
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Figure  10:  Exaii^les  of  two  individuals  ^belng  routed  through  the 
A emd  B item  pools  by  TAILOR-APL 
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score  is  closest  to  zero.  Plusses  and  minuses  in  the  chain  represent  correct 
and  incorrect  answers  as  the  test  progresses.  After  each  response r both 
items  and  persons  receive  new  scores  wd  the  examinees  are  2issigned  to  the 
items  «diich  best  match  their  net  dominiuice  scores  and  which  haven't  been 
given  or  iiic>lied  yet.  At  the  end  of  each  test  cd.1  the  items  have  either 
iiqplied  answers  or  have  actually  been  given  and  scored. 

From  the  figure  it  is  possible  to  see  that  the  size  and  direction  of 
the  branching  step  is  variable.  Wrong  answers  are  not  necessarily  follotrad 
by  easier  items  or  correct  cmswers  by  h2urder  items  although  that  is,  in 
general , the  case . 

Although  this  pair  of  examinees  took  many  more  items  and  received  much 
more  extreme  scores  in  the  B item  pool  than  in  the  A item  pool,  this  is  not 
generally  true  to  such  an  extent.  The  B items  over  all  subjects  involved 
only  3%  more  items  in  eui  average  test  and  the  standard  deviations  of  B 2uid 
A scores  were  23.592  and  20.283  respectively. 

Discussion 

Based  on  previous  evaluations  of  TAILOR  in  the  group  testing  form 
with  Monte  Carlo  techniques  cmd  simulation  testing,  there  was  reason  to  e]q>ect 
the  individual  testing  version  to  produce  reliabilities  slightly  less  than 
complete  tests  and  reduce  the  items  presented  to  a half  or  a third  depending 
on  the  quality  of  the  item  pool. 

The  overall  level  of  tailoring  (.44  of  the  items  presented)  and  the 
asymptotic  value  (.32)  were  therefore,  not  une]q>ected,  but  the  speed  with  which 
tailoring  took  place  in  terms  of  examinees  was  very  surprising.  In  each  of 
four  tailored  tests  only  two  exasdnees  had  to  answer  25  items.  After  15 


subjects  h^ld  been  tested  the  program  seemed  to  reach  an  asymptotic  value.  This 
contrasts  rather  startlingly  with  tailoring  techniques  that  require  100-150 
complete  tests  (in  the  case  of  Reckase's  one  parameter  method) , vqp  to  several 
thousand  recommended  frequently  for  the  more  complex  procedures  (summary  in 
Reckeise  1977) . 

Had  TAILOR-APL  produced  scores  not  significcintly  less  reliable  than  com- 
plete test  scores,  the  program  would  have  done  what  could  reasonably  be  expected 
of  a test  tailoring  method,  but  again  TAILOR-APL  outperformed  our  expectations. 
Though  Fisher's  z transformation  fails  to  show  a significant  increase  in  relia- 
bility, the  overall  reli£d>ility  is  necessarily  a compromise  between  the  shor- 
tening of  the  test  cmd  what  appears  to  be  an  increase  in  the  discrimination 
of  the  items.  If  the  remaining  items  had  been  added  to  the  end  of  the  tailored 
test  the  difference  in  reliabilities  may  have  been  significant. 

Although  the  data  in  Figure  8 seem  to  be  dramatically  different  for  items 
in  the  tailored  £uid  oon^lete  test  condition,  the  writer  confesses  his  inability 
to  test  the  difference  in  slope  for  significeuice. 

For  some  reason,  the  same  anagrams  when  presented  in  a tailored  test  are 
delineating  more  precisely  between  edjility  levels.  There  are  reasons  why  a 
teiilored  test  might  provide  more  reliable  measurement  than  a conventional  test. 
By  presenting  items  close  to  a person's  ^U^ility,  guessing  is  not  encouraged 
as  it  would  be  if  the  items  were  too  difficult,  nor  are  examinees  likely  to 
become  bored  and  careless  as  they  might  if  the  test  was  felt  to  be  too  easy. 

It  is  possible  that  by  requiring  a steady  effort  from  the  examinee  his  be- 
havior is  made  more  consistent. 

An  additional  influence  may  be  TAILOR'S  tendency  to  begin  each  test  with 
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items  either  too  difficult  or  too  easy  and  slowly  drift  past  the  examinee's 
ability  level.  Consistency  could  be  Induced  by  the  elevating  or  depressing 
[ effects  of  early  items  on  later  performance. 

If  we  assume  that  as  long  as  an  examinee  is  performing  well  his  responses 

; 

to  test  items  axe  better  than  they  would  normally  be,  he  will  tend  to  pass 
his  usual  level  of  performance  before  making  em  error.  When  an  error  is 

I 

finally  made  the  elevating  effect  is  ended  and  items  beyond  that  point  are 
consistently  failed  as  they  would  have  been  normally.  Since  not  all  items 
presented  after  the  initial  failure  will  be  cd>ove  the  normal  ability  level, 

; the  discrimination  will  not  be  perfect,  but  the  tendency  of  this  process  will 

I be  to  increase  discrimination.  A similar  effect  could  be  hypothesized  in  the 

t 

I opposite  direction  due  to  a depressing  effect  caused  by  too  difficult  items. 

I To  distinguish  between  these  hypotheses  a tailoring  system  which  presented 

items  near  a person's  ability  level,  but  not  in  order  of  difficulty,  would 
have  to  be  conquered  to  the  present  system. 

There  are  three  issues  concerning  the  operation  of  TAILOR-APL  which  are 
i beyond  the  current  investigation  and  which  may  be  importamt  in  the  future. 

The  first  two  are  potential  difficulties  which  were  not  troublesome  in  the 
current  study,  but  which  may  cause  problems  in  a new  application.  The  third 
is  a potential  difficulty  in  other  tailoring  programs  which  will  illustrate 
some  of  the  unique  benefits  of  basing  the  tailoring  process  on  information 
gathered  during  the  administration  of  tailored  tests. 

There  is  a possibility  that  if  the  items  are  not  as  evenly  divided  by 
the  mean  ability  of  the  examinees,  as  they  are  here,  giving  a new  examinee 
the  item  whose  net  dominwce  score  is  closest  to  zero  will  not  be  the  item 
closest  to  a .50  probability  of  being  wswered  correctly.  Because  the  item 
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score  includes  item  dominance  information  as  well  as  person  domina.:ces,  an 
item  which  has  a net  person  domineuice  of  zero  will  be  offset  from  zero  accord- 
ing to  its  item  dcmiinances. 

This  problem  is  peculiar  to  the  individual  testing  version  of  TAILOR, 
TAILOR-APL.  The  original  formulation  of  the  technique  (Cliff,  1975)  was  for 
the  grot^  testing  version  in  which  both  items  and  persons  collect  dominauices 
simultaneously  so  items  will  not  be  offset  by  the  results  of  previously  adminis- 
tered tests.  Any  item  dominances  will  be  equally  attached  to  person  and  item 
scores  as  the  test  progresses. 

Another  potential  problem  in  the  operation  of  the  program  concerns  the 
treatment  of  poorly  discriminating  items.  The  program  is  designed  to  continue 
a test  until  all  item  outcomes  are  implied  or  observed  through  actual  adminis- 
tration. A poorly  discriminating  item  is  the  least  likely  item  to  be  implied 
since  it  has  the  fewest  reliable  dominance  relations  with  other  items.  The 
result  could  be  that  the  worst  items  in  the  item  pool  are  asked  most  often 
singly  becuase  they  have  few  of  the  relations  to  the  rest  of  the  test  which 
would  allow  them  to  be  implied. 

The  mechcuiism  for  eliminating  poorly  discriminating  items  has  not  yet 
been  implemented,  but  the  derivation  of  indices  which  will  be  necessary  for 
this  task  has  been  acconplished  (Cliff,  1977). 

A grooming  of  the  item  pool  to  eliminate  ability/difficulty  offset  could 
also  be  included  in  the  subroutine  which  removes  non-discriminating  items.  Once 
indices  are  computed  and  the  mean  ability  level  is  known  with  reasonable  accuracy 
both  tasks  could  be  accomplished  by  little  more  than  visual  inspection  or  a 


single  cutoff  value. 


I 

I 

I duced  in  paraneters  eatlinated  from  conplete  test  data. 


First  there  are  errors  Introduced  because  of  order  effects  among  the 
items.  Since  the  order  in  which  items  are  presented  in  tailored  tests  differs 
systesuitically  from  the  order  in  vAiich  they  are  presented  in  a complete 
test,  any  change  in  item  discrimination  or  difficulty  caused  by  the  order  in 
which  items  are  presented  during  tailored  testing  will  result  in  inaccuracy 
and  inefficiency  in  tailoring  systems  which  rely  on  accurate  parameter  esti- 
mates. 

The  greater  discrimination  of  the  items  in  the  current  study  when  pre- 
sented in  a tailored  test  is  evidence  that  such  order  effects  do  occur.  If 
the  higher  discrimination  is  a general  characteristic  of  tailored  tests 
there  will  be  other  studies  which  will  find  higher  reliabilities  in  tailored 
tests  th£m  in  conplete  tests.  Tailored  test  reliability  can  only  be  higher 
th^m  conplete  test  reliability  if  the  items  are  not  independent  of  each  other. 
If  the  examinee's  history  of  correct  and  incorrect  answers  is  effecting  the 
probability  of  his  correctly  euiswering  the  current  item,  then  the  model  which 
says  the  probability  of  a correct  response  to  a particular  item  is  purely  a 
function  of  that  examinee's  true  score  is  not  accurately  describing  the 
examinee's  behavior.  Because  the  current  data  is  not  a statistically  signi- 
ficant example  of  higher  relictbility  it  must  by  itself  remain  only  suggestive. 

Killcross  (1976)  in  his  review  of  tailored  testing  recognized  the  impor- 
tance of  item  order  effects  and  suggested  an  index  of  context  reliedbility  be 
devised.  He  reviews  nine  studies  which  looked  at  such  things  a whether 
students  would  score  higher  on  a test  which  was  given  in  ascending  item  diffi- 
culty or  descending  difficulty,  whether  overall  test  variance  cund  mean  diffi- 
culty were  accurately  reflected  by  data  from  small  item  s2unples  euid  whether 


success  of  failure  on  an  item  caused  the  next  item  to  be  harder  or  easier. 


50 


None  of  the  studies,  apparently,  used  speeded  items,  items  with  knowledge  of 
results  provided  or  subsets  of  items  ordered  in  difficulty  or  matched  to  the 
examinee's  ability.  The  only  positive  finding  was  that  items  in  a 'quantita- 
tive thinking'  test  became  significantly  easier  when  the  test  was  reduced  to 
one  fourth  its  size. 

Although  some  elements  of  the  way  items  were  presented  in  tailored  test- 
ing were  looked  at  in  these  studies,  they  were  never  looked  at  all  together 
or  even  all  separately,  nor  was  discrimination  used  as  a criterion.  There 
are  indications  from  tailored  testing  research  done  elsewhere  that  order 
effects  are  operating. 

Waters  (1976)  reported  higher  validities  for  "stradaptive"  tests  with 
19,  25  or  31  vocabulary  items  conpared  to  a conventional  test  of  48  items. 

The  number  of  items  in  a "stradaptive"  test  is  to  some  extent  deceptive  in 
this  case  because  an  initial  graded  response  item  is  used  to  determine  the 
individual's  entry  point.  This  may  have  an  effect  equivalent  to  several 
items  administered  within  the  test  itself.  The  graded  response  item  is  un- 
likely to  have  eui  effect  equal  to  the  10  or  20  items  which  represent  the 
differences  between  the  tailored  and  the  complete  tests. 

Waters  obtained  correlations  with  a criterion  of  .499,  .536,  and  .536  for 
the  tailored  tests.  The  correlation  between  the  complete  test  cind  the  cri- 
terion was  .477. 

Not  all  tailoring  techniques  would  be  adversely  effected  by  such  altera- 
tions in  discrimination  parameters,  if  they  become  a familiar  result  of  tai- 
loring. Certainly  they  had  no  harmful  effects  on  the  "stradaptive"  procedure. 
Alteration  of  other  item  characteristics,  however,  seem  even  more  likely.  A 
series  of  investigations  at  Minnesota  (Betz  & Weiss,  1976a  and  1976b;  Weiss 
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1975)  shows  that  when  Icnowledge  of  results  is  provided  during  tailored  testing 
dramatic  changes  occur  in  item  difficulty.  Since  it  is  impossible  for  know- 
ledge 2dx>ut  the  correctness  of  one's  emswer  to  affect  performance  on  the  item 
for  which  it  is  given,  this  necessarily  meams  the  items  are  not  independent. 

Aside  from  the  problem  of  item  parameters  changing  when  the  items  are 
moved  from  a conventional  test  to  a tailored  test,  there  is  the  additional 
problem  of  the  appropriateness  of  the  pareuneters  and  standards  for  any  parti- 
cular sample.  It  may  be  argued  that  changing  a test  for  every  group  that 
takes  it  destroys  the  soundness  of  the  measurements  as  standardized  evaluations. 
In  fact,  however,  careful  re-evaluation  of  items  is  necessary  for  the  same 
reasons  that  the  intial  item  analysis  takes  place  during  construction  of  the 
test.  A score  on  the  Stanford-Binet  taken  in  1960  Ceui  be  as  much  as  ten  IQ 
points  away  from  the  same  score  achieved  by  a person  of  identical  age  in 
1972  (Battler,  1974) . A vocabulary  item  which  discriminates  well  when  given 
in  an  intelligence  test  in  one  population  may  give  poor  discrimination  when 
administered  to  people  in  another. 

The  sheer  burden  of  collecting  hundreds  or  thouseuids  of  con^lete  test 
responses  to  re-parameterize  a tailored  test  mcdces  the  tailoring  methods  which 
gather  their  information  from  complete  test  data  more  prone  to  being  adminis- 
tered with  parameters  that  cure  out  of  date  or  inappropriate,  assuming,  of  course, 
that  appropriate  parameters  Ccui  be  derived  from  complete  test  data  to  begin 
with. 

Pretesting  is  not  only  an  expensive  undertciking  (Reckase,  1977)  which 
would  make  tailored  testing  infeasible  for  tests  which  will  not  )3e  given 
to  thousands  of  examinees,  but  it  also  jeopardizes  item  security  and  requires 
a double  standard  in  testing.  Some  examinees  get  a tailored  test  and  others 


have  to  spend  time  with  a longer  examination  and  face  the  possiblity  of  a less 


relicdile  score. 

I In  contrast  to  the  pretesting  tailoring  methods,  TAILOR  quickly  adapts 

to  any  changes  in  person-item  relations.  It  can  also  be  "frozen"  with  a 
particular  reference  san^le  is  the  user  wishes.  Item  security  is  not  jeo- 
pardized by  using  the  same  item  pool  year  after  year  nor  is  the  operation  of 

i the  procedure  jeopardized  by  "knowledge  of  results"  if  the  tester  chooses 

^ to  give  his  students  feedback. 

i 

Program  Availability 

Copies  of  the  APL  progrcun  used  for  this  research  are  available  from 
Douglas  McCormick,  Psychology  Department,  University  of  Southern  California, 
University  Park,  Los  Angeles,  California,  90007. 

Conclusion 

This  first  empirical  test  of  TAILOR-APL  with  50  live  examinees  has  shown 
that  a test  can  be  reduced  to  44%  of  its  original  length  with  no  pretesting 
of  the  items  emd  no  significant  loss  in  test  reliability  when  comparison  is 
made  to  a complete  test  administered  under  con^arable  conditions. 
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Footnote 


Gilhooly,  K.  J.  and  Hay,  D.  Imagery,  concreteness,  age-of 


acquisition,  familiarity  and  meaningfulness  values  for  205  five-letter  words 
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Appendix. A:  Anagrams  emd  Solutions 


Set  A Set  B 


si£h 

fish 

hymer 

rhyme 

ihpmc 

chimp 

naldg 

gland 

dtovi 

divot 

owrk 

work 

ofctr 

croft 

abclk 

black 

ilyrc 

lyric 

tnoek 

token 

ucont 

count 

ilpa 

pail 

firya 

fairy 

culnh 

lunch 

gryol 

glory 

uavlt 

vault 

pleh 

help 

odnf 

fond 

ihbrc 

birch 

ecnbh 

bench 

paonr 

apron 

nevah 

haven 

hglit 

light 

albez 

blaze 

letl 

tell 

knale 

angle 

opitv 

pivot 

ihra 

hair 

ibta 

bait 

nogme 

gnome 

kool 

look 

ibrot 

orbit 

orfo 

roof 

Ised 

sled 

htpde 

depth 

rlveo 

lover 

fkeni 

knife 

carhi 

chair 

enuco 

ounce 

goibt 

bigot 

oolw 

wool 

epil 

pile 

ubdto 

doubt 

evcor 

cover 

mdone 

demon 

krnac 

crank 

tlave 

valet 

bdran 

brand 

(xnnaw 

woman 

enog 

gone 
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Development  Activity 
ATCP-  HRQ 

Ft.  Benjamin  Harrison,  IN  46249 

1 Armed  Forces  Staff  College 
Norfolk,  VA  23511 
ATTN : Library 

1 Dr.  Stanley  L.  Cohen 

U.  S.  Army  Reseaurch  Institute  for 
the  Behavioral  amd  Social  Sciences 
1300  Wilson  Boulevard 
Arlington,  VA  22209 

1 Dr.  Ralph  Dusek 

U.  S.  Army  Research  Institute  for 
the  Behavioral  and  Social  Sciences 
1300  Wilson  Boulevard 
Arlington,  VA  22209 

1 Dr.  Joseph  Waurd 

U.  S.  Army  Research  Institute  for 
the  Behavioral  amd  Social  Sciences 
1300  Wilson  Boulevard 
Arlington,  VA  22209 

1 HQ  USAREUR  & 7th  Army 
ODCSOPS 

USAREUR  Director  of  GED 
APO  New  York  09403 

1 ARI  Field  Unit  - Leavenworth 
Post  Office  Box  3122 
Fort  Leavenworth,  KS  66027 


1 D.  M.  Gragg,  CAPT,  MC,  USN 

Head,  Educational  Programs  Develop- 
ment Department 

Naval  Health  Sciences  Education  and 
Training  Connand 
Bethesda,  MD  20014 


1 


fterine  Corps 


Dr.  Milton  Naier 
U.  S.  Aimjf  Research  Institute 
for  the  Behavioral  and  Social 
Sciences 

1300  Wilson  Boulevard 
Arlington,  VA  22209 

1 Dr.  Milton  S.  Katz,  Chief 
Individual  Training  & Perfor- 
mance Evaluation 

U.  S.  Army  Research  Institute  for 
the  Behavioral  and  Social 
Sciences 

1300  Wilson  Boulev2u:d 
Arlington,  VA  22209 


Air  Force 

1 Research  Branch 
AF/I»>MfAR 

Randc^h  AFB,  Tx  78148 

1 Dr.  G.  A.  Echstrand  (AFHRL/AST) 
Wright  Patterson  AFB 
C^io  45433 

1 AFWRL/DOJN 
Stop  #63 

Lackland  AFB,  TX  78236 

1 Dr.  Martin  Rockway  (AFBRL/TT) 
Lowry  AFB 
Colorado  80230 

1 Dr.  Alfred  R.  Fregly 
APOSR/WL 

1400  Wilson  Boulevard 
Arlington,  VA  22209 

1 AFHRL/PED 
Stop  #63 

Lackland  AFB,  TX  78236 

1 Major  Wayne  S.  Sellman 
Chief  of  Personnel  Testing 
Hg  USAF/DPMYP 
Randolph  AFB,  Tx  78148 


1 Director,  Office  of  Manpower 
Utilization 

Headguuters,  Marine  Corps  (Code  MPU) 
MCB  (Building  2009 
Quantico,  VA  22134 

1 Dr.  A.  L.  Slafkosky 

Scientific  Advisor  (Code  RD-1) 
Headquarters,  U.  S.  Marine  Corps 
Washington,  DC  20380 

1 Chief,  Academic  Department 
Education  Center 
Marine  Corps  Development  and 
Education  Command 
Meurine  Corps  Base 
Quantico,  VA  22134 

1 Mr.  E.  A.  Dover 

2711  South  Veitch  Street 
Arlington,  VA  22206 


Coast  Guard 

1 Mr.  Joseph  J.  Cowan,  Chief 

Psychological  Research  Branch  (G-P- 
1/62) 

U.  S.  Coast  Guard  Headquaurters 
Washington,  DC  20590 

Other  POD 

1 Dr.  Harold  F.  O'Neil,  Jr. 

Advanced  Research  Projects  Agency 
Q^bemetics  Technology,  Rm.  625 
1400  Wilson  Boulevard 
Arlington,  VA  22209 

12  Defense  Documentation  Center 
Cameron  Station,  Building  5 
Alexandria,  VA  22314 
ATnit  TC 


other  Government 


Miscellaneous 


1 Dr.  Lorraine  O.  Eyde 

Personnel  Research  and  Development 
Center 

U.  S.  Civil  Service  Commission 
1900  E Street,  N.  W. 

Washington,  DC  20415 

1 Or.  William  Gorham,  Director 

Personnel  Research  and  Development 
Center 

U.  S.  Civil  Service  Commission 
1900  E Street,  N.  W. 

Washington,  DC  20415 

1 Dr.  Vern  Urry 

Personnel  Research  cmd  Development 
Center 

U.  S.  Civil  Service  Commission 
1900  E Street,  N.  W. 

Washington,  DC  20415 

1 Dr.  Harold  T.  Yahr 

Personnel  Research  euid  Development 
Center 

U.  S.  Civil  Service  Commission 
1900  E Street,  N.  W. 

Washington,  DC  20415 

1 Dr.  Andrew  R.  Molnar 
Technical  Innovations  in 
Education  Group 
National  Science  Foundation 
1800  G Street,  N.  W. 

Washington,  DC  20550 

1 U.  S.  Civil  Service  Commission 
Federal  Office  Building 
Chicago  Regional  Staff  Division 
Regional  Psychologist 
230  South  Dearborn  Street 
Chicago,  IL  60604 
ATTO;  C.  S.  Winlewicz 

1 Dr.  Carl  Frederlksen 

Learning  Division,  Basic  Skills 
Grot^ 

National  Institute  of  Education 
1200  19th  Street,  N.  W. 

Washington,  DC  20208 


1 Dr.  Scarvia  B.  TUiderson 
Educational  Testing  Service 
17  Executive  Park  Drive,  N.  E. 
Atlanta,  GA  30329 

1 Mr.  Samuel  Ball 

Educational  Testing  Service 
Princeton,  NJ  08540 

1 Dr.  Gerald  V.  Barrett 
University  of  Akron 
Department  of  Psychology 
Akron,  C«  44325 

1 Dr.  Kenneth  E.  Clark 
University  of  Rochester 
College  of  Arts  emd  Sciences 
River  Campus  Station 
Rochester,  NY  14627 

1 Dr,  John  J.  Collins 
Vice  President 
Essex  Corporation 
6305  Caminlto  Estrellado 
San  Diego,  CA  92120 

1 Or.  Rene  V.  Dawis 

University  of  Minnesota 
Department  of  Psychology 
Minneapolis,  MN  55455 

1 Dr.  Marvin  D.  Dunnette 
University  of  Minnesota 
Department  of  Psychology 
Minneapolis,  MN  55455 

1 ERIC 

Processing  £md  Reference  Facility 
4833  Rugby  Avenue 
Bethesda,  MD  20014 

1 Major  I.  N,  Evonic 

C2madian  Forces  Personnel 
Applied  Research  Unit 
1107  Avenue  Road 
Toronto,  Ontario,  CANADA 

1 Dr.  Victor  Fields 
Montgomery  College 
Department  of  Psychology 
Rockville,  MD  20850 


1 Dr.  Edwin  A.  Fleishman 
Visiting  Professor 
University  of  California 
Graduate  School  of  Administration 
Irvine,  CA  92664 

1 Dr.  John  R.  Frederiksen 

Bolt,  Beranek  wd  Nevnocui,  Inc. 

50  Moulton  Street 
Cambridge,  NA  02138 

1 Dr.  Robert  Glaser,  Co-Director 
University  of  Pittsburgh 
3939  O'Hara  Street 
Pittsburgh,  PA  15213 

1 Dr.  Richard  S.  Hatch 

Decision  Systems  Associates,  Inc. 
5640  Nicholson  Lwe 
Rockville,  MD  20852 

1 Dr.  M.  D.  Havron 

Hunicin  Sciences  Reseeurch,  Inc. 

7710  Old  Spring  House  Road 
West  Gate  Industrial  Park 
McLean,  VA  22101 

1 HumRRO  Central  Division 
400  Plaza  Building 
Pace  Boulevard  at  Fairfield  Drive 
Pensacola,  FL  32505 

1 HumRRO/Westem  Division 
27857  Berwick  Drive 
Carmel,  CA  93921 
ATTN  t Library 

1 Dr.  David  Klcihr 

Carnegie-Mellon  University 
Department  of  Psychology 
Pittsburgh,  PA  15213 

1 Or.  Alma  E.  Lantz 
University  of  Denver 
Denver  Research  Institute 
Industrial  Economics  Division 
Denver,  00  80210 

1 Dr.  Frederick  M.  Lord 

Educational  Testing  Service 
Princeton,  NJ  08540 


1 Dr.  Robert  R.  Mackie 

Human  Factors  Research,  Inc. 

6780  Cor ton  Drive 

Santa  Baurbara  Research  Park 

Goleta,  CA  93017 

1 Or.  William  C.  Mann 

University  of  Southern  California 
Information  Sciences  Institute 
4676  Admiralty  Hay 
Marina  Del  Rey,  CA  90291 

1 Mr.  Edmond  M^u:ks 
315  Old  Main 

Pennsylvania  State  University 
University  Park,  PA  16802 

1 Richard  T.  Mowday 

College  of  Business  Administration 
University  of  Nebraska,  Lincoln 
Lincoln,  N£  68588 

1 Dr.  Leo  Munday,  Vice-President 
American  College  Testing  Program 
P.  O.  Box  168 
Iowa  City,  lA  52240 

1 Mr.  Luigi  Petrullo 

2431  North  Edgewood  Street 
Arlington,  VA  22217 

1 Dr.  Steven  M.  Pine 

University  of  Minnesota 
Department  of  Psychology 
Minneapolis,  MN  55455 

1 Dr.  Diane  M.  Ramsey-Klee 

R-K  Research  & System  Design 
3947  Ridgemont  Drive 
Malibu,  CA  90265 

1 Dr.  Joseph  W.  Rigney 

University  of  Southern  California 
Behavioral  Technology  Laboratories 
3717  South  Gremd 
Los  Angeles,  CA  90007 

1 Dr.  Andrew  M.  Rose 

American  Institutes  for  Research 
3301  Mew  Mexico  Avenue,  N.  W. 
Washington,  DC  20016 


1 Dr.  George  E.  Rowland 
Rowland  auid  Ccnqpany,  Inc. 

P.  O.  Box  61 
Haddonfleld,  NJ  08033 

1 Dr.  Benjamin  Schneider 
University  of  Psychology 
Department  of  Psychology 
College  Park,  MD  20742 

1 Dr.  Lyle  Schoenfeldt 
Department  of  Psychology 
University  of  Georgia 
Athens,  Georgia  30602 

1 Dr.  Arthur  I.  Siegel 

Applied  Psychological  Services 
404  Eeist  Lancaster  Avenue 
Wayne,  PA  19087 

1 Dr.  Henry  P.  Sims,  Jr. 

Room  630  - Business 
Indiama  University 
Bloomington,  IN  47401 

1 Dr.  C.  Harold  Stone 
1428  Virginia  Avenue 
Glendale,  CA  91202 

1 Dr.  Patrick  Siqppes,  Director 

Institute  for  Nathematiceil  Studies 
in  the  Social  Sciences 
Stanford  University 
Stanford,  CA  94305 

1 Dr.  Sigmund  Tobias 

PH.D  Programs  in  Education 
Graduate  Center 
City  university  of  New  York 
33  West  42nd  Street 
New  York,  NY  10036 

1 Dr.  David  J.  Weiss 

University  of  Minnesota 
Department  of  Psychology 
N660  Elliott  Hall 
Minneapolis,  Ml  55455 

1 Dr.  K.  Wesoourt 
Stanford  Universl^ 

Institute  for  Mathematical  Studies 
in  the  Social  Sciences 
Stanford,  CA  94305 


Dr.  Anita  West 
Denver  Research  Institute 
University  of  Denver 
Denver,  CO  80210 

Hr.  George  Wheaton 
Aioericeui  Institutes  for  Research 
3301  New  Mexico  Avenue,  N.W. 
Washington,  DC  20016 


